Project Name:- Hotel Booking Demand

DSC 478 Final Project

NAWAAZ SHARIF (2015155)

SYED NOOR RAZI ALI (2070326)

MOHAMMED RASHIDUDDIN (2070301)

Data Pre-Processing

Cleaning the Dataset

Histograms

The Distribution of all Categorical Variables in Histograms

The above histograms represent all the categorical variables available in our dataset.

From the histograms we can get the basic information such as:

1. AGENT: which agent gets the highest number of bookings and which does not.

2. ARRIVAL_DATE_YEAR: which year had the highest number of guests, in our case we had highest guests in 2016.

3. ARRIVAL_DATE_DAY_OF_MONTH: which day of the month had the highest number of guests.

4. ARRIVAL_DATE_WEEK_NUMBER: which week of the year had the highest number of guets.

5. ARRIVAL_DATE_MONTH: we can see that we had highest number of guests in the months of May, June, July and August.

6. IS_REPEATED_GUEST: how many of our guests are repeated, in our case not many guests are repeated.

7. DISTRIBUTION CHANNEL: in our dataset most of the guests book their hotels with the help of Travel Agents(TA) and Tour Operators(TO).

8. IS_CANCELLED: we get to know that how many of our guests have cancelled their stay, which is likely to be helpful in the future.

9. HOTEL: we have more guests that prefer city hotel over resort hotel.

10. MEAL: we can see that most of our guests prefer BB meal.

11. COUNTRY: most of our guests are from PRT which makes sense as the hotels are located in PRT itself.

12. MARKET SEGMENT: most of our bookings come from Online Travel AGents.

13. RESERVED_ROOM_TYPE/ ASSIGNED ROOM TYPE: our guests prefer room A above other rooms.

14. DEPOSIT TYPE: most of the bookings does not have prior deposits.

15. CUSTOMER TYPE: most of our customers are transient types.

The Top 10 values in the continous variables.

Correlation Map for Continous Variables.

From the above correlation matrix, we can see that few features have correlation of above 0.1 like ADR ,adult children and the babies, the correlation of ADR and adults is of 0.2 which is low correlated and the correlation of ADR and children is 0.33 which is the moderately correlated and the correlation of ADR, and babies is 0.029. The highest correlation is of stays_in_week_nights to stays_in_weekend_nights of 0.49 which is highly correlated when compared to all the variables.

Top 15 countried where Guests came to reside.

From the above bar chart we can infer that most of the guests came from Portugal to reside in the hotel as we progress we can see a gradual decrease in count with a difference of 40,000 when compared to the top country and the second country. The following countries where the guests came from is Great Britain, France, Spain and Germany with the count similar to 10,000 Guests.

Visual Representation of the Hotel Column in Pie Chart

From the above pie chart we can infer that the most often guest came to reside in city hotel. When compared to the resort hotel, we can see a 66.4% of the total guest occupancy in the city hotel when compared to 33.6% of total guest came to the enjoy in resort hotel.

We can also infer that the most demanded hotel is the city hotel.

Statistical Map showing Nationality of vistors from around the world.

THe geograohical areas are coloured where people came to stay in the hotel.

We have tried a new visualization to do show the nationality of the visitors around the world, this clearly highlights the geographical location where the people came to stay in the hotel. As we can see, from the above choropleth that the most of the visitors were from the European countries and with the highest from Portugal.

Boxplot and Histogram of Continous Variable

We have plotted, a histogram and a box plot for all the continuous variables in our data set.

  1. From the above box plot for the lead time we can infer that most of the outliers are from 400 to 700. Similarly we can infer from the lead time histogram that most of the data is from 0 to 400. The median is near to the lower quartile of the box plot. We can see the lead time is right skewed.

  2. For the second box plot stays in weekend nights we can also see that the most of the outliers are from 5 to 17.5 and similarly for its histogram we can see a distribution of data completely to the left from 0 to 2.5 the median is closely related is between the box plot from this weekend in for that, there is no skewness and stays in weekend nights.

  3. For stays in the week nights, we can infer that the box plot is very small, and the lower end of the whisker is very close to the lower quartile region of the box plot, and the median is at the center of the box plot and we do see a larger whisker in the box plot for the same histogram. We can see these days in the week and night it’s from 0 to 10 and we see a majority of outliers about 10.

  4. For the adult box plot, we don’t see a box plot, clearly similarly for the children and for the babies and four previous cancellations, previous bookings not canceled and booking changes and days in the waiting list and required car parking spaces we don’t see the box plot.

  5. For the ADR box plot, we can infer that the lower end, and the upper end of the risk is of the same length, and the median is at the between of the box plot, and we can infer that there is no skewness in the box plot, the same representation goes to the histogram where we see a majority of the data from 0 to 300 that specifies when floating the box plot that the outliers are from 200 to 500.

  6. For the total of special request box plot, we can infer that there is no lower end whisker and we see a few outliers present at 3,4and 5.

  7. From the above box plot of continuous variable box plot, we can infer that the most of the variables have outliers and sometimes we don’t see the lower end of the whisker or the median itself. this clearly gives us the idea that since the city hotel is more cheaper, when compared to the resort hotels, we can see a most diverse type of guest, residing in the city hotel when compared to an expensive accommodation of resort hotel.

Representing the univariate dataset using the graphical data analysis in one dimension.

The above step lots highlights differences between the resort Hotel and city hotel over several features, such as lead time, stays in week in nights ,stays in week nights, adults, children, babies, previous cancellations, previous bookings not cancel, booking changes, days in waiting list, ADR, required car parking spaces and total of special requests.

Visualizing which year there was more booking or more cancellations

From the above bar plots, we can see that bookings received between the years 2015 and 2017. The bar graph shows if the bookings were cancelled and how many bookings were cancelled in each year.

Plotting Pie Charts for assigned_room_type,assigned_room_type,market_segment,meal,is_repeated_guest

Using the pie chart, we are showing the distribution of data for features such as assigned room type in which we can see that most of the bookings received were from room C and the next room that was preferred by the guests was room A; and then for the next segment that is market segment the most of the bookings received were direct and corporate; also for the meal as we can see most of the guest prefer BB meal and FB meal and for their repeated guests, we can see that not much of our guests were repeated.

Cross-tabulating hotel, is_repeated_guest and is_cancelled.

Using crosstab method, we plot the graphs between:

hotel and is_cancelled: we can see which hotel has how many cancellations.

hotel and is_repeated_guest: we can see which hotel has how many repeated guests.

arrival_date_month and is_cancelled: which month has the highest number of cancellations; in our case AUGUST has highest number of bookings as well as cancellations.

Converting the Labels to numericak form [converting categorical to numerical variable]

Classification

Techniques [KNN, Decision Tree, Multinominal Naive Bayes, SVM, Random Forest, Neural Network and Gradient Descent]

Hotel is the Target Variable.

KNN

For cross validation we are using Grid Search with cv=10

Considering the best value for KNN and applying it on test dataset.

When we take the KNN as the classification metric and hotel as the target variable, we can see a accuracy of 95 percentage that means that the model was able to predict which hotel it was whether it was a city hotel, or a resort hotel, and it was able to achieve a accuracy of 95%.

Confusion Matrix for KNN

From the above confusion, matrix we can infer that most of the hotels were predicted as city hotel since in our data said we had majority of the data points as of city hotel

Decision Tree

Classification using Decision Tree using gini and entropy and using the min split from 10 to 90.

We got on accuracy of 92 percentage, while using decision tree classifier, which tells us that the model was able to predict between the resort hotel and the city hotel, and it is less when compared to KNN classifier.

Visualizing the Decision Tree

RMSE for Decision Tree

Confusion Matrix for Decision Tree

From the above confusion matrix, we can infer that around 1800 observations were predicted wrong, which is a little less when compared to the KNN classifier.

Multi Nominal bayes

Using grid search for cross validation.

Using the best parameters on the test dataset.

For the multi nominal bias, we can infer that the accuracy score is very less when compared to all the classification models. We have an accuracy of 79 percentage.

RMSE of Multi Nominal Bayes

Confusion Matrix for Multi Nominal Bayes

For the confusion matrix for the multi nominal bayes, we can also see that around 5000 of the observations were labeled incorrectly.

SVM

Performing Gridsearch as the cross validation technique using gamma, C and kernel parameters respectively.

Using the best parameters to perform on test dataset.

For the SVM we can see and accuracy of 96 percentage which infers that most of the hotels were predicted accurately about 96 percentage

RMSE value of SVM

Confusion matrix of SVM

From the confusion matrix, we can infer that around 150 observations were labeled incorrectly, and we see a least observations among all the classification models, which were labeled incorrectly.

ROC CURVE

The above ROC graph shows us the performance of different models applied with "hotel" as the target variable. KNN, Decision Tree and SVM performed good as they have high AUC, whereas Multinomial Naive Bayes has low AUC. For our dataset we can choose both Decision tree and SVM as they both have an accuracy of 0.96.

RandomForest

Classification Report Random Forest

For the random forest, we can see a highest among all the models with an accuracy of 99 percentage, which says that the most of the observations were predicted accurately about 99 percentage between the resort hotel, and the city hotel.

RMSE Value of Random Forest.

Confusion Matrix of Random Forest

From the above confusion matrix, we can infer that around 300 observations labeled incorrectly as resort and city hotel.

Stochastic Gradient Descent

RMSE and Classification Report for Gradient Descent

Coefficients of SGD.

Confusion Matrix of SGD

From the about SGD classification model we can say that the accuracy is of 70 percentage, which is nearly less when compared to rest of the models, and we see that agent as well as ADR are the most contributed coefficient of SGD.

NLP Neural Networks

Confusion Matrix

RMSE Value of NLP

Classification Report NLP.

Visulaizing the Confusion matrix of NLP

We also see the highest score of 98% when we take neural networks is the classification motor with an accuracy of 98 percentage on with the RMSE value of 0.14

Comparing all the models performed with "hotel" as the target variable in decreasing order of Accuracy.

  1. Random Forest performed best with the Accuracy of 0.99, RMSE value of 0.109 and score of 0.9884.
  2. NLP has the accuracy of 0.98, RMSE value of 0.149
  3. SVM has the accuracy of 0.96, RMSE value of 0.191 and score of 0.9588.
  4. KNN has the accuracy of 0.95, RMSE value of 0.225 and score of 0.946.
  5. Decision Tree has the accuracy of 0.92, RMSE value of 0.279 and score of 0.980.
  6. Multinomial Naive Bayes has the accuracy of 0.79, RMSE value of 0.457 and score of 0.7911.
  7. Gradient Descent has the accuracy of 0.70, RMSE value of 0.321 and score of 0.5556.

Classification

Techniques [KNN, Decision Tree, Multinominal Naive Bayes, SVM, Random Forest, Neural Network and Gradient Descent]

Using is_cancelled as the target variable

KNN

For cross validation we are using Grid Search with cv=10

Classification Report of KNN.

RMSE value of KNN

Confusion Matrix of KNN

When we consider is canceled as our target variable, and apply our classification model of KNN. We get an accuracy of the 83 percentage, and we also see that the similar number of is canceled and not cancelled observations are labeled incorrectly.

Decision Tree

Classification using Decision Tree using gini and entropy and using the min split from 10 to 90.

Classification Report of Decision Tree.

Visualizing the Decision Tree

RMSE Value of Decision Tree

Confusion Matrix of Decision Tree

We see an accuracy of 79 percentage for the decision tree, which is a little bit less when compared to our previous model KNN we have also visualized the tree with a depth of 5, and we also infer that around 5000 observations well label incorrectly for is canceled and not canceled

Multi Nominal bayes

Using grid search for cross validation.

Classification Report of Multi Nominal bayes

RMSE of Multi Nominal bayes

Confusion Matrix of Multi Nominal bayes

For multi nominal bayes, we also see a least accuracy when compared to the rest of the models of 75 percentage, and we see a highest amount 6000 of the observations which were labeled incorrectly.

SVM

Using grid search for cross validation.

Classification Matrix of SVM

RMSE of SVM

Confusion Matrix of SVM

For the SVM when we consider, it’s canceled as our target variable, we see an accuracy of 82 percentage, and we see a least amount of observations which were labeled incorrectly of thousand records.

AOC Curve

The above ROC graph shows us the performance of different models applied with "is_cancelled" as the target variable. Decision Tree and SVM performed good as they have high AUC, whereas KNN and Multinomial Naive Bayes has low AUC. For our dataset we can choose both SVM and Decision tree as they both have an accuracy of 0.82 respectively.

Random Forest

Classification Report of Random Forest

RMSE of Random Forest

Confusion Matrix of Random Forest

The best model which has perform accurately when we consider is canceled as the target variable and for the random forest we achieved with an highest accuracy of 90 percentage and from the confusion matrix we see around 2500 of observations were labeled incorrectly, and most of labels were what product it has is canceled when compared to not canceled and this is an improvement when compared to the rest of the classification models

Stochastic Gradient Descent

RMSE value of SGD Regression

Confusion Matrix

For the gradient descent we see a least accuracy of 68 percentage, and we also see the contributing coefficient of SGD which are deposit type and total of special request

Neural Network

Confusion Matrix of Neural Network.

RMSE value of NLP

Classification Report of NLP

Confusion Matrix of NLP

When we run neural network on our data set and considering is cancelled as our target variable, we achieved an accuracy of 83 percentage, which infers that the model was able to predict 82 percentage of is canceled and not canceled records from the data set

Comparing all the models performed with "is_cancelled" as the target variable:

  1. Random Forest performed best with the Accuracy of 0.90, RMSE value of 0.32 and score of 0.894.
  2. KNN has the accuracy of 0.83, RMSE value of 0.415 and score of 0.946.
  3. NLP has the accuracy of 0.82, RMSE value of 0.424.
  4. SVM has the accuracy of 0.82, RMSE value of 0.426 and score of 0.828.
  5. Decision Tree has the accuracy of 0.79, RMSE value of 0.460 and score of 0.8589.
  6. Multinomial Naive Bayes has the accuracy of 0.75, RMSE value of 0.497 and score of 0.753.
  7. Gradient Descent has the accuracy of 0.68, RMSE value of 0.39 and score of 0.342.